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SUMMARY 


A  partial  least  squares  treatment  of  multivariate  data 
related  through  a  complex  model  allows  one  to  evaluate  the 
interactions  between  large  numbers  of  features  at  once. 

Results  where  the  model  is  of  water  sources  flowing  together, 
each  block  composed  of  water  quality  data,  allow  the  influence 
of  the  various  sources  to  be  evaluated  with  respect  to  their 
importance  on  the  resulting  flow  downstream. 


3 


When  the  goal  of  a  study  is  to  understand  tlie  Inter¬ 
relationship  among  several  parts  of  a  complex  system, 
statistical  procedures  are  often  employed  to  analyse  features 
from  sets  of  samples  collectively  used  to  represent  each  part. 
All  too  often,  the  number  of  features  and/or  parts  is  larger 
than  the  number  of  samples  and  many  multivariate  statistical 
procedures  fail  to  be  useful.  A  simple  example  is  the  case 
where  one  set  of  independent  features  is  to  be  related  to  only 
one  dependent  feature  by  multiple  regression  analysis, 
represented  as  Model  I  in  Figure  1.  The  calculation  can  give  a 
perfect  but  possibly  meaningless  fit  if  the  number  of  features 
is  greater  than  the  number  of  samples.  For  the  establishment 
of  a  predictive  model  this  problem  is  normally  overcome  by  the 
use  of  stepwise  regression  analysis.  Hovjever,  in  this  analysis 
the  regression  coefficients  are  uninformative  with  respect  to 
our  understanding  of  the  model  and  t ne  results  provide  no 
information  about  the  utility  of  the  omitted  features,  which 
may  be  only  a  little  less  informative  than  those  chosen  to 
provide  the  best  fit. 

Consider  the  case  \rtiere  multiple  blocks  of  data,  each 
block  consisting  of  several  features  obtained  over  several 
samples,  are  to  be  Interrelated  by  a  complex  scheme  or  path 
model.  When  only  one  block  of  features  is  to  be  related  to  a 
second  block  of  features,  shown  as  Model  II  in  Figure  1,  a 


canonical  correlation  analysis  [1]  or  target-transformation 


analysis  [2]  can  be  carried  out.  For  more  than  two  blocks  of 
data  various  multidimensional  scaling  techniques  have  been 
developed  [3]  which  relate  blocks  of  features  along  axes 
preserving  the  maximum  amount  of  all  interblock  information  at 
once.  However,  when  not  ail  interconnections  between  blocks 
are  desired  or  relevant,  more  flexible  methodology  is  required. 

This  new  methodology,  herein  called  the  PLS  (Partial  Least 
Squares)  approach  to  Path  Modelling  using  Latent  Variables,  has 
recently  been  developed  by  H.  IJold  [4-8]  .  This  important  new 
tool  allows  blocks  of  features  to  be  represented  by 
unobserveables  or  "latent"  variables  indirectly  observed.  The 
latent  variables  are  then  related  to  one  another  by  a  path  or 
interconnection  scheme  predetermined  by  the  user.  Tlie  latent 
variables  are  found  by  an  Iterative  procedure  involving  simple 
and  multiple  regression  analysis  so  that  they  simultaneously 
and  optimally  (in  the  PLS  sense)  represent  the  measured 
features  and  provide  the  best  fit  to  the  path  model.  The 
method  is  so  general  that  principal  component  analysis, 
multiple  regression  analysis,  and  canonical  correlation 
analysis  are  included  as  special  cases.  The  first  application 
of  this  method  to  the  physical  sciences,  an  analysis  of  water 
chemistry  measurements  to  assess  tl»e  environmental  impact  of 
mine  spoils  drainage,  is  reported  here. 
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In  order  to  understand  the  impact  of  coal  raining  on  local 

water  quality,  R.  Skogerboe  et  al  19]  raonitored  several  water 

quality  parameters  at  numerous  sites  on  Trout  Creek  in 

Colorado.  Data  taken  at  monthly  intervals  from  October  1973  to 

July  1976  were  provided  by  Skogerboe  [10]  for  this  study.  Five 

sites  best  characterized  the  environmental  impact  and  were 

selected  for  our  present  analysis.  Site  1  is  upstream  from 

runoff  influenced  by  spoils  of  the  Midway  Edna  Coal  Mine,  which 

is  adjacent  to  the  stream.  Sites  2,  3,  and  4  monitor  the 

runoff  from  strip  nine  spoils  representing  mining  activity  from 

the  1930s  to  the  1940s,  the  1940s  to  the  1950s,  and  the  1960s 

to  the  present,  respectively.  Runoff  from  these  sites  enters 

the  stream  in  the  order  given  above.  Site  5  is  downstream  from 

the  nine.  Only  25  months  of  data  were  included  in  this  study 

since  occasionally  several  features  at  a  site  were  not 

determined  in  certain  months.  At  each  site  the  data  set  was 

composed  of  eleven  features,  pH,  Cl“,  S0^“,  Ca^^,  Fe^^,  K^, 

2+  2+  +  2+  - 

Mg  ,  Mn  ,  11a  ,  Zn  ,  and  HCO^,  all  but  pK  reported  in  mg/1. 
The  final  data  set  had  approximately  eight  percent  of  its 
values  missing,  which  we  filled  in  so  as  to  minimize  any 
deviation  from  a  particular  site's  known  data  structure  (11). 

Our  goal  was  to  establish  a  path  model  using  all  five 
sites.  Each  site,  represented  by  a  data  matrix  of  11  features 
sampled  over  25  months,  was  used  in  the  model  as  a  separate 


entity.  In  our  present  case  the  path  model  is  clearly  that 
shown  as  Model  III  in  Figure  1.  The  only  relationship  possible 
is  that  site  1,  the  upstream  site,  and  sites  2,  3,  and  4  mix  to 
form  site  5,  the  downstream  site. 

In  order  to  consider  the  effect  of  all  features  at  once 
the  method  forms  latent  variables, 

s 

k  k,i  k,i 

at  each  site,  where  is  the  number  of  features  being 

considered  at  site  k,  x  is  the  value  of  feature  i,  and  the 

K ,  i 

a,  .'s  are  coefficients  determined  in  the  course  of  the 
k,i 

analysis.  The  a,  's  for  each  of  the  upstream  sites  are 
ic  >  1 

estimated  from  a  multiple  regression  of  all  the  features  at  a 
particular  site  to  the  downstream  latent  variable,  L^,  as 
diagramed  in  Model  III  of  Figure  1.  All  coefficients  a,  are 

K  jl 

then  scaled  so  that  the  latent  variables  L,  have  unit  variance. 

k 

Next,  is  regressed  upon  the  upstream  latent  variables  to 
estimate  the  ^'s  in  the  expression 

S  ■  I 

Using  the  ^(^^5  s  and  Lj^'s  to  estimate  we  perform  a  multiple 
regression  of  the  features  of  site  5  on  it  in  order  to  estimate 
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the  a.  's  .  From  the  newly  found  a,  we  form  a  new 

j » 1  ^ » i 

Lj  which  is  scaled  to  unit  variance  and  the  entire  procedure  is 

repeated  until  all  a^^  ^  and  Pj^  ^  converge.  All  calculations 

were  initiated  with  all  a,  .  and  P,  ^  set  to  one.  A  similar 

k,i  k,5 

series  of  patli  models  can  be  developed  to  analyse  any  number  of 

blocks  of  variables  connected  by  any  set  of  paths. 

Using  all  11  features  in  each  block,  the  calculation  of 

I'iOdel  III  converged  with  an  overall  fit  of  0.99.  Tlie  square  of 

2 

the  fit  correlation  coefficient,  R  ,  gives  the  relative  amount 
of  information  at  accounted  for  by  the  other  four  latent 
variables  and  is  calculated  from 

^  ‘  2^  \,5\,5 

where  ^  is  the  correlation  between  and  L^.  The  site 
2 

contributions  to  R  are  given  in  Table  1.  We  note  that  the 

good  fit  is  primarily  due  to  a  strong  relation  between  sites  4 

and  5.  The  contributions  of  each  individual  feature  to  the  fit 

were  calculated  and  showed  that  the  high  correlation  was  due 

-  2+  2+ 

largely  to  a  fit  between  HCO^  at  site  4  and  Ca  a^;d  Ilg  at 

site  5.  Although  only  a  small  amount  of  the  total  variance  in 

all  of  the  data  is  accounted  for  by  this  relationship,  it  is  a 

rather  striking  one  as  HCO^  introduced  by  site  4  strongly 
2+  2+ 

buffers  the  Ca  and  Mg  concentration. 


8 


A  principal  component  analysis  of  the  features  at  site  5 
yielded  two  readily  Interpretable  components.  The  first 

2't*  2  ^  ^  ^ 

component  represented  the  major  salt  load  Ca  ,  Mg  ,  Na  ,  K  , 
2- 

SO^  ,  and  Cl  on  the  creek  and  the  second  component 
represented  primarily  the  trace  metals  zinc  and  manganese. 

Thus,  a  more  directed  analysis  targeting  on  the  principal 
components  was  suggested.  Results  of  Model  III  calculations 
where  is  represented  by  an  individual  principal  component 
are  also  shown  in  Table  1.  The  first  component  is  modeled  by 
the  upstream  values  of  site  1  and  the  first  source  of  mine 
drainage  represented  by  site  2.  These  results  indicate  that 
site  2  has  by  far  the  most  dramatic  effect  on  water  quality. 
Similar  results  were  obtained  for  the  second  principal 
component  with  an  additional  smaller  contribution  from  site  4. 

We  have  also  performed  Model  III  calculations  when 
Lj  represents  only  one  of  the  features  from  site  5,  a  non¬ 
iterative  calculation.  An  example  using  Cl  is  also  shown  in 
Table  1.  Though  the  concentrations  of  Cl  and  the  other  major 
species  at  sites  2,  3,  and  4  are  comparable  in  magnitude  [9] , 
drainage  from  site  2  is  obviously  the  dominant  influence  on  the 
downstream  Cl  concentration.  Drainage  represented  by  site  4 
also  perturbs  the  downstream  Cl  concentration,  most  likely 
because  it  represents  flow  from  the  newest  spoils,  which  have  a 
greater  concentration  of  the  more  soluble  salts.  The  lack  of 
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influence  from  site  3  shows  that  drainage  by  this  site  is  not 
different  enough  or  large  enough  to  alter  the  Cl  composition 
set  at  site  2. 

From  the  above  it  is  clear  that  quantitative  estimates  of 
the  effect  of  stream  components  contributing  to  the  load  at  the 
downstream  site  can  be  made.  In  addition,  detailed  information 
can  be  obtained  on  each  component.  For  example,  for  many 
species  wiiich  have  a  high  concentration  at  an  upstream  site  but 
fail  to  be  used  in  modelling  the  downstream  site,  we  believe 
some  form  of  buffering  or  precipitation  action  may  be  taking 
place.  In  these  cases  the  PLS  analysis  show  where  more 
extensive  investigation  should  be  directed  if  the  stream 
chemistry  is  to  be  fully  understood.  Conclusions  we  have 
arrived  at  using  the  PLS  path  modelling  scheme  are  compatible 
with  those  obtained  in  our  laboratory  using  a  battery  of 
standard  multivariate  techniques  on  a  more  extensive  data  set 
of  which  the  present  data  is  a  subset. 

The  above  results  show  how  PLS  path  modelling  using  latent 
variables  can  provide  insight  into  the  interrelationships 
between  groups  of  features.  It  is  especially  important  to  note 
that  the  treatment  of  groups  of  features  as  a  unit  allows  one 
to  include  many  more  features  in  the  analysis  than  would 
normally  be  allowed  by  more  conventional  techniques  when  one  is 
confronted  with  limited  quantities  of  data.  In  all  the  above 
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calculations  we  have  considered  4A  features  in  sites  1  through 
4  and  obtained  consistently  interpretable  results  with  only  25 
sets  of  data.  This  form  of  analysis  can  be  a  powerful  aid  to 
anyone  confronted  with  blocks  of  features  which  are  related  to 
one  another  along  a  set  of  logical  paths. 


This  work  was  partially  supported  by  the  Office  of  Naval 
Research. 
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Fig.  1.  Model  I  represents  a  multiple  regression  analysis  of 
one  matrix  onto  a  single  feature,  Model  II  depicts  two  matrices 
of  features  related  to  one  another,  and  Model  III  shows  the 
particular  multi-matrix  path  model  dealt  with  through  a  partial 
least  squares  analysis.  In  Model  III  the  4  matrices  on  the 
left  represent  sources  of  flow  in  a  watershed  which  combine  to 
form  the  flow  represented  by  the  fifth  matrix. 
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Table  1.  (Pj^  ^)X(R|^  j)  values  for  sites 

2 

through  4  and  the  corresponding  R  for 
models  where  is  described  in  column  1 


PCs  are  principal  components 


Site 

1 

2 

3 

4 

r2 

11 

Features 

0.02 

-0.04 

0.06 

0.93 

0.97 

PC 

1 

0.35 

0.69 

-0.16 

0.03 

0.91 

PC 

2 

0.21 

0.59 

0.00 

0.11 

0.91 

Cl" 

0.09 

0.58 

-0.08 

0.29 

0.88 

